Abstract
This assignment tries to find underlying connections between variables in the student alcohol use dataset. The null hypothesis of the study, that no significant categories can be found in the data, was confirmed. No discernible categories could be identified using multiple correspondence analysis, but additional exploration of the data using first graphical tools and then logic regression showed that the education levels (secondary and higher contra lower) correlate between mothers and fathers of the students, and there is further correlation between the employment of the parents in the same sector.Before we can start analysing the student data, some groundwork is needed to prepare the environment for the analysis, and actually import the data as well.
Much of the data preparation work towards better analysis possibilities with MCA has been done already in the data wrangling script.
We’ll start off by importing the necessary libraries that will provide us the many of the functions we’ll use in this assignment. Because the libraries merely provide us the functions we’ll describe when they’re used in the context of the data, I’ve included short descriptions of the libraries as code comments below.
library(FactoMineR) # This includes a lot of factor analysis stuff
library(factoextra) # Considered, if you will, an extension to FactoMiner with more progressive plotting capabilities
library(ggplot2) # The do-all plotting library
library(dplyr) # Everyone needs to manipulate data
library(corrplot) # Here to provide more beautiful correlation coefficient plotting
library(tidyr) # Tidyr provides us data manipulation functions
The next bit is importing the data and taking a look at it.
load(file = "/Volumes/tuti/IODS-final/data/wrangled_students.Rdata")
dim(rawdata)
## [1] 382 31
The data has 382 observations of 38 variables. Let’s next look at the variables.
str(rawdata)
## 'data.frame': 382 obs. of 31 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : Factor w/ 2 levels "High","Low": 1 2 2 1 1 1 2 1 1 1 ...
## $ Fedu : Factor w/ 2 levels "High","Low": 1 2 2 2 1 1 2 1 2 1 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: Factor w/ 4 levels "1","2","3","4": 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : Factor w/ 4 levels "1","2","3","4": 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : Factor w/ 5 levels "1","2","3","4",..: 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : Factor w/ 5 levels "1","2","3","4",..: 4 3 2 2 2 2 4 4 2 1 ...
## $ health : Factor w/ 5 levels "1","2","3","4",..: 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ G3_quart : Factor w/ 4 levels "Q1","Q2","Q3",..: 1 1 2 4 2 4 2 1 4 4 ...
Further information on all of the original variables, as well as the origin of the data, is available from the source
A short recap of the variables in the wrangled dataset:
Descriptions for unmodified variables are from the source mutatis mutandis factorised variables. Variables created during data wrangling are described by the author.
Finally, let’s see how the observations are divided in the data. We’ll use both the classical summary function and a graphical overview. As we’ve got lots of variables, we also define a higher width and height for the graphics to display nicely.
It pays to take a graphical look at the variables. People may have different ways of making sense of data containing multiple variables, but a graphical representation usually works at least for finding out whether there are some variables that are so skewed that they would also skew any further analysis.
summary(rawdata)
## school sex age address famsize Pstatus Medu
## GP:342 F:198 Min. :15.00 R: 81 GT3:278 A: 38 High:230
## MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104 T:344 Low :152
## Median :17.00
## Mean :16.59
## 3rd Qu.:17.00
## Max. :22.00
## Fedu Mjob Fjob reason nursery
## High:198 at_home : 53 at_home : 16 course :140 no : 72
## Low :184 health : 33 health : 17 home :110 yes:310
## other :138 other :211 other : 34
## services: 96 services:107 reputation: 98
## teacher : 62 teacher : 31
##
## internet guardian traveltime studytime failures schoolsup
## no : 58 father: 91 1:250 1:103 Mode :logical no :331
## yes:324 mother:275 2:103 2:190 FALSE:316 yes: 51
## other : 16 3: 21 3: 62 TRUE :66
## 4: 8 4: 27
##
##
## famsup paid activities higher romantic famrel freetime
## no :144 no :205 no :181 no : 18 no :261 1: 9 1: 18
## yes:238 yes:177 yes:201 yes:364 yes:121 2: 18 2: 62
## 3: 66 3:156
## 4:183 4:109
## 5:106 5: 37
##
## goout health absences G3 high_use G3_quart
## 1: 24 1: 46 Min. : 0.000 Min. : 0.00 Mode :logical Q1:100
## 2: 99 2: 43 1st Qu.: 0.000 1st Qu.: 8.00 FALSE:270 Q2:126
## 3:123 3: 83 Median : 3.000 Median :11.00 TRUE :112 Q3: 82
## 4: 82 4: 64 Mean : 5.319 Mean :10.39 Q4: 74
## 5: 54 5:146 3rd Qu.: 8.000 3rd Qu.:14.00
## Max. :75.000 Max. :20.00
gather(rawdata) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
From these overviews we can see deduce the following things that are pertinent for further analysis:
The null hypothesis is that no significant categories can be established in the data. The alternate hypothesis is that the education levels of the parents are categorised with a high quartile of the final grade.
Having now acquainted ourselves with the contents of the dataset, it’s time to see whether there are any interesting structures hidden in the data.
We will use the quantitative (integer) variables (indices 3,28,29) as supplementary variables in the MCA, while we base the actual categorisation on the categorical variables. As before, we will mostly use wider and taller figures, as the amount of information we wish to display is quite large.
Our first task is to look at the summary of the MCA to see whether a low number of dimensions account for a high percentage of variance, which would indicate a significant finding in the analysis. In addition to the textual representation of the summary we will add a scree plot of the eigenvalues of the variables to determine a possible drop in the contribution levels to the variance.
mca <- MCA(rawdata, ncp = 5, quanti.sup=c(3,28,29), graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = rawdata, ncp = 5, quanti.sup = c(3, 28, 29), graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.109 0.086 0.065 0.063 0.061 0.060
## % of var. 5.557 4.390 3.319 3.206 3.124 3.066
## Cumulative % of var. 5.557 9.947 13.266 16.472 19.596 22.662
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.058 0.055 0.054 0.052 0.052 0.049
## % of var. 2.932 2.807 2.763 2.658 2.625 2.517
## Cumulative % of var. 25.594 28.401 31.165 33.823 36.448 38.965
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17 Dim.18
## Variance 0.048 0.045 0.044 0.043 0.042 0.041
## % of var. 2.432 2.298 2.249 2.199 2.145 2.096
## Cumulative % of var. 41.397 43.695 45.944 48.142 50.287 52.383
## Dim.19 Dim.20 Dim.21 Dim.22 Dim.23 Dim.24
## Variance 0.040 0.039 0.038 0.038 0.037 0.036
## % of var. 2.046 2.004 1.934 1.916 1.894 1.818
## Cumulative % of var. 54.429 56.433 58.366 60.283 62.177 63.995
## Dim.25 Dim.26 Dim.27 Dim.28 Dim.29 Dim.30
## Variance 0.035 0.034 0.033 0.033 0.031 0.031
## % of var. 1.761 1.722 1.681 1.669 1.597 1.578
## Cumulative % of var. 65.756 67.478 69.159 70.827 72.424 74.002
## Dim.31 Dim.32 Dim.33 Dim.34 Dim.35 Dim.36
## Variance 0.030 0.029 0.028 0.027 0.026 0.026
## % of var. 1.526 1.479 1.433 1.398 1.343 1.329
## Cumulative % of var. 75.527 77.006 78.438 79.837 81.180 82.508
## Dim.37 Dim.38 Dim.39 Dim.40 Dim.41 Dim.42
## Variance 0.025 0.024 0.023 0.023 0.022 0.021
## % of var. 1.271 1.203 1.194 1.172 1.120 1.067
## Cumulative % of var. 83.780 84.982 86.176 87.348 88.468 89.535
## Dim.43 Dim.44 Dim.45 Dim.46 Dim.47 Dim.48
## Variance 0.020 0.020 0.019 0.019 0.018 0.016
## % of var. 1.037 1.034 0.970 0.949 0.926 0.836
## Cumulative % of var. 90.572 91.606 92.575 93.525 94.451 95.287
## Dim.49 Dim.50 Dim.51 Dim.52 Dim.53 Dim.54
## Variance 0.015 0.015 0.015 0.014 0.013 0.012
## % of var. 0.774 0.762 0.757 0.697 0.652 0.609
## Cumulative % of var. 96.061 96.823 97.581 98.278 98.930 99.539
## Dim.55
## Variance 0.009
## % of var. 0.461
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.070 0.012 0.002 | -0.258 0.203 0.029 | -0.217 0.189
## 2 | 0.288 0.199 0.060 | -0.388 0.458 0.109 | -0.002 0.000
## 3 | 0.357 0.306 0.064 | -0.303 0.278 0.046 | -0.301 0.363
## 4 | -0.425 0.432 0.101 | -0.130 0.051 0.009 | 0.229 0.210
## 5 | -0.147 0.052 0.019 | -0.291 0.257 0.073 | -0.138 0.077
## 6 | -0.467 0.523 0.188 | 0.156 0.074 0.021 | 0.106 0.045
## 7 | 0.180 0.077 0.028 | -0.068 0.014 0.004 | -0.157 0.099
## 8 | -0.057 0.008 0.001 | -0.294 0.262 0.029 | -0.448 0.806
## 9 | -0.350 0.293 0.071 | -0.025 0.002 0.000 | 0.044 0.008
## 10 | -0.238 0.136 0.034 | 0.234 0.166 0.032 | 0.018 0.001
## cos2
## 1 0.020 |
## 2 0.000 |
## 3 0.046 |
## 4 0.029 |
## 5 0.017 |
## 6 0.010 |
## 7 0.022 |
## 8 0.068 |
## 9 0.001 |
## 10 0.000 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## GP | -0.104 0.318 0.093 -5.945 | 0.003 0.000 0.000
## MS | 0.891 2.718 0.093 5.945 | -0.023 0.002 0.000
## F | -0.051 0.044 0.003 -1.031 | -0.578 7.183 0.360
## M | 0.055 0.047 0.003 1.031 | 0.622 7.730 0.360
## R | 0.685 3.258 0.126 6.939 | -0.160 0.224 0.007
## U | -0.184 0.877 0.126 -6.939 | 0.043 0.060 0.007
## GT3 | 0.013 0.004 0.000 0.415 | -0.062 0.116 0.010
## LE3 | -0.035 0.011 0.000 -0.415 | 0.166 0.310 0.010
## A | -0.331 0.356 0.012 -2.147 | 0.101 0.042 0.001
## T | 0.037 0.039 0.012 2.147 | -0.011 0.005 0.001
## v.test Dim.3 ctr cos2 v.test
## GP 0.156 | -0.107 0.561 0.098 -6.103 |
## MS -0.156 | 0.914 4.795 0.098 6.103 |
## F -11.712 | -0.207 1.212 0.046 -4.183 |
## M 11.712 | 0.222 1.304 0.046 4.183 |
## R -1.616 | 0.519 3.131 0.073 5.257 |
## U 1.616 | -0.140 0.842 0.073 -5.257 |
## GT3 -1.979 | 0.015 0.008 0.001 0.463 |
## LE3 1.979 | -0.039 0.022 0.001 -0.463 |
## A 0.652 | -0.477 1.238 0.025 -3.092 |
## T -0.652 | 0.053 0.137 0.025 3.092 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## school | 0.093 0.000 0.098 |
## sex | 0.003 0.360 0.046 |
## address | 0.126 0.007 0.073 |
## famsize | 0.000 0.010 0.001 |
## Pstatus | 0.012 0.001 0.025 |
## Medu | 0.406 0.119 0.000 |
## Fedu | 0.351 0.056 0.000 |
## Mjob | 0.361 0.198 0.033 |
## Fjob | 0.124 0.088 0.061 |
## reason | 0.064 0.060 0.073 |
##
## Supplementary continuous variables
## Dim.1 Dim.2 Dim.3
## age | 0.268 | 0.046 | -0.018 |
## absences | -0.013 | 0.082 | -0.168 |
## G3 | -0.438 | 0.026 | 0.399 |
fviz_eig(mca, ncp = 10)
The MCA summary shows that the contributions of the dimensions to the variance are not especially high, with the first dimension contributing just ~5.6% and the second dimension only ~4.4%. Therefore the explanatory potential of the model is not very substantial, but we will nevertheless use the analysis to gauge what little we can extract from the modest contribution of the first two dimensions to the variance.
From the scree plot of the eigenvalues we can clearly see that the contribution to variance evens out after the first two dimensions. Hence limiting further analysis to two dimensions is justifiable.
We will next look at the contributions of the variables, at value level, to the first two dimensions. By doing this we can already get a feeling of the groupings we may expect when we later look at the dimensions and the observations in a more graphical fashion.
vars <- get_mca_var(mca)
head(vars$cos2, 8)
## Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
## GP 0.0927770952 6.348081e-05 0.0977595148 0.02426203 0.130303865
## MS 0.0927770952 6.348081e-05 0.0977595148 0.02426203 0.130303865
## F 0.0027895829 3.600425e-01 0.0459181801 0.01245582 0.031258784
## M 0.0027895829 3.600425e-01 0.0459181801 0.01245582 0.031258784
## R 0.1263757621 6.850096e-03 0.0725224026 0.03964565 0.120288337
## U 0.1263757621 6.850096e-03 0.0725224026 0.03964565 0.120288337
## GT3 0.0004516079 1.027673e-02 0.0005625846 0.05545224 0.001836966
## LE3 0.0004516079 1.027673e-02 0.0005625846 0.05545224 0.001836966
fviz_contrib(mca, choice = "var", axes = 1, top = 28)
In the contributions to the first dimension, we see the education variables dominating with failures.
fviz_contrib(mca, choice = "var", axes = 2, top = 28)
In the contributions to the second dimension, the sexes contribute much of the variance.
corrplot(vars$cos2, is.corr=FALSE)
fviz_mca_var(mca, choice = "mca.cor",
repel = TRUE, # Avoid text overlapping (slow)
ggtheme = theme_minimal())
fviz_mca_var(mca, choice = "quanti.sup",
ggtheme = theme_minimal())
As we can see, the absences grow along dimension 1, while the final grade and age are nearly opposites on dimension 2.
fviz_mca_biplot(mca, repel = TRUE, ggtheme = theme_minimal())
fviz_mca_var(mca, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # avoid text overlapping (slow)
ggtheme = theme_minimal()
)
fviz_mca_ind(mca, col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE, # Avoid text overlapping (slow if many points)
ggtheme = theme_minimal())
While the dimensions’ contributions to the total variance are not discernible, some of the values of the variables are relatively closely grouped together. For example, a high amount of free time seems to be closely matched with going out a lot, and both the mother’s and father’s employment as teachers seem to coincide. Similarly, a low amount of study time seems to be close to the high use of alcohol.
If we select values that are >??.5 on both dimensions, we end up with the following groupings in the four quadrants:
Top-left:
Top-right
Bottom-right
Bottom-left
While the dimensions explain a low portion of the variance, the groupings can be tentatively used for e.g. identifying students who would benefit from supportive measures in their education. We can also use other forms of analysis to measure whether some of the variables correlate to study results (as we have done in the next section).
One quite intuitive and graphical way to spot whether our variables actually have explanatory power within their dimensions is to plot the variables on the two-dimensional space with concentration ellipses.
fviz_ellipses(mca, addEllipses=TRUE, "failures", geom = "point")
fviz_ellipses(mca, "Medu", geom = "point")
fviz_ellipses(mca, "Fedu", geom = "point")
fviz_ellipses(mca, "Mjob", geom = "point")
fviz_ellipses(mca, "Fjob", geom = "point")
fviz_ellipses(mca, "high_use", geom = "point")
It is obvious from the lack of individuals within the most of the ellipses that the variables are not very representative by themselves in the two-dimensional space.
To wrap things up, we will resort to a factor map of the values of variables.
plot(mca, habillage = "quali", invisible=c("ind"))
As is abundantly evident already, the Multiple Correspondence Analysis failed to explain more than 10% of the variance for the first two easily graphically representable dimensions, and as we could see from the MCA summary in the early stage of our analysis, the cumulative contributions to the variance did not cross the 50% mark until the 17th dimension. Hence the original and constructed factor variables cannot be categorised in a statistically significant manner.
As we did not find any significant categories within the data, the null hypothesis is proved as valid.
Why stop with MCA where you’re having fun? Based on exploring the variables before we started further analysis, some unanswered questions on the connections of the variables crept up. Let’s try and see if we can shed some light on the families of the students.
plot(rawdata$Mjob~rawdata$Fjob, xlab="Mother's occupation", ylab="Father's occupation", main="Occupational homogeneity", cex = 0.5)
plot(rawdata$Medu~rawdata$Fedu, xlab="Mother's education level", ylab="Father's education level", main="Educational homogeneity", cex = 0.5)
From these standard graphs we can deduce that the parents educational levels correlate quite heavily. We can try to certify this through logistic regression.
summary(glm(Fedu ~ Medu, data = rawdata, family = binomial))
##
## Call:
## glm(formula = Fedu ~ Medu, family = binomial, data = rawdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8793 -0.7623 -0.7623 0.6125 1.6599
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0871 0.1518 -7.159 8.11e-13 ***
## MeduLow 2.6652 0.2635 10.113 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 529.05 on 381 degrees of freedom
## Residual deviance: 398.86 on 380 degrees of freedom
## AIC: 402.86
##
## Number of Fisher Scoring iterations: 4
From the summary we can see that the mother’s low education is a strong predictor of the father’s education, and the two correlate quite heavily.
We would also like to find out whether the father’s or mother’s employment to a certain sector predicts the student’s final grade. For this we analyse both a graphical representation of the final grades for each sector and then compile a linear regression with the final grade as the dependent variable.
qplot(G3, data = rawdata, facets = Fjob~., geom = "freqpoly", binwidth = 1, xlab = "Final grade", ylab = "Students", main="Father's job and final grade")
summary(lm(G3 ~ Fjob, data = rawdata))
##
## Call:
## lm(formula = G3 ~ Fjob, data = rawdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7097 -2.1896 0.6916 3.2802 9.6916
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.7500 1.1715 8.322 1.6e-15 ***
## Fjobhealth 1.7794 1.6323 1.090 0.276
## Fjobother 0.4396 1.2151 0.362 0.718
## Fjobservices 0.5584 1.2561 0.445 0.657
## Fjobteacher 1.9597 1.4425 1.359 0.175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.686 on 377 degrees of freedom
## Multiple R-squared: 0.01097, Adjusted R-squared: 0.0004729
## F-statistic: 1.045 on 4 and 377 DF, p-value: 0.3837
qplot(G3, data = rawdata, facets = Mjob~., geom = "freqpoly", binwidth = 1, xlab = "Final grade", ylab = "Students", main="Mother's job and final grade")
summary(lm(G3 ~ Mjob, data = rawdata))
##
## Call:
## lm(formula = G3 ~ Mjob, data = rawdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0606 -1.7609 0.2419 3.2391 10.1509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.8491 0.6330 13.980 < 2e-16 ***
## Mjobhealth 3.2115 1.0219 3.143 0.00181 **
## Mjobother 0.9118 0.7447 1.224 0.22156
## Mjobservices 2.4739 0.7886 3.137 0.00184 **
## Mjobteacher 1.9090 0.8621 2.214 0.02740 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.608 on 377 degrees of freedom
## Multiple R-squared: 0.04355, Adjusted R-squared: 0.0334
## F-statistic: 4.291 on 4 and 377 DF, p-value: 0.002086
qplot(G3, data = rawdata, facets = Medu~Fedu, geom = "freqpoly", binwidth = 1, xlab = "Final grade", ylab = "Students")
summary(lm(G3 ~ Medu + Fedu, data = rawdata))
##
## Call:
## lm(formula = G3 ~ Medu + Fedu, data = rawdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1489 -2.1489 0.7094 3.1140 9.7094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.1489 0.3371 33.068 <2e-16 ***
## MeduLow -1.5955 0.5852 -2.726 0.0067 **
## FeduLow -0.2629 0.5733 -0.459 0.6468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.619 on 379 degrees of freedom
## Multiple R-squared: 0.03391, Adjusted R-squared: 0.02881
## F-statistic: 6.651 on 2 and 379 DF, p-value: 0.001449
Finally we’ll check graphically whether there seems to be any correlation between the number of absences and the final grade quartiles of the students.
qplot(absences, data = rawdata, facets = G3_quart~., geom = "freqpoly")
As most students have no absences independent of their quartile, it’s hard to determine any substantial correlation between absences and final grade from the graphical representation. The fourth quartile figure suggests, though, that the students with highest final grades did not have a substantial number of absences.